Analysis of epigenetic signals in cell-free DNA

The following sections provide details for the analyses performed in:

DNA methylation and gene expression as determinants of genome-wide cell-free DNA fragmentation

Citation

Michaël Noë, Dimitrios Mathios, Akshaya V. Annapragada, Shashikant Koul, Zacharia H. Foda, Jamie Medina, Stephen Cristiano, Christopher Cherry, Daniel C. Bruhm, Noushin Niknafs, Vilmos Adleff, Leonardo Ferreira, Hari Easwaran, Stephen Baylin, Jillian Phallen, Robert B. Scharpf, and Victor E. Velculescu. DNA methylation and gene expression as determinants of genome-wide cell-free DNA fragmentation

Abstract

Circulating cell-free DNA (cfDNA) is emerging as a diagnostic avenue for cancer detection, but the characteristics and origins of cfDNA fragmentation in the blood are poorly understood. We evaluated the effect of DNA methylation and gene expression on naturally occurring genome-wide cfDNA fragmentation through analysis of plasma from 969 individuals, including 182 with cancer. cfDNA fragment ends occurring at preferred locations genome-wide more frequently contained CCs or CGs, and fragments ending with CGs or CCGs were enriched or depleted, respectively, at methylated CpG positions, consistent with structural models showing increased interaction of methylated CG fragment ends with nucleosomes. Higher levels and larger sizes of cfDNA fragments were independently associated with regions of CpG methylation and reduced gene expression, and reflected differences in cfDNA fragmentation in tissue-specific pathways. The effects of methylation and expression on cfDNA coverage were validated by analyses of human cfDNA in mice implanted with isogenic tumors with or without the mutant IDH1 chromatin modifier. Tumor-related hypomethylation and increased gene expression were associated with global decrease in cfDNA fragment size that may explain the overall smaller cfDNA fragments observed in human cancers. Cancer-specific methylation at CpGs of pancreatic cancer patients was associated with genome-wide changes in cfDNA fragment ends in patients with cancers. These results provide a connection between epigenetic changes and cfDNA fragmentation that may have implications for disease detection.

Introduction

The analyses explained in this file uses the data from previous studies (Christiano et al., Nature, 2019 and Mathios et al., Nature Communications, 2021). The data for the samples analyzed in these studies was deposited at the database of Genotypes and Phenotypes (dbGaP) and the European Genome-Phenome Archive (EGA). The GitHub-repositories for these studies explain how the data was analysed until a GenomicRanges (GRanges) object was made with the fragment chromosome, start- and end-positions (pre-processing in the ‘reproduce_lucas_wflow’-GitHub). These objects were saved as ‘.rds’ file. Unlike BED-files (not used here), GRanges objects contain the start-position of the fragment, while BED-files use the position before the start-position as the first base in the fragment.

For the analyses explained in this file, we will start from raw-data as saved in ‘.rds’-files containing GRanges-objects. In order to construct the plots in the paper, we often generate temporary files which are too big or too numerous to upload to the GitHub repository. These files will be store in a folder on the same level as the folder containing the repository (cfepigenetics), called ‘cfepigenetics_data’. The temporary files stored there will be used as input for a summarizing script that will generate a final summary-file, which is small enough to store into the data-folder in this GitHub repository.

Required data

The data used in this study has been made publically available when published with previous studies.

  • Cristiano et al., Nature, 2019: the database of Genotypesand Phenotypes (dbGaP, study ID 34536).
  • Mathios et al., Nature Communications, 2021: the European Genome-Phenome Archive (EGA, EGAS00001005340).

The methylation data of cell-free DNA used in this study was published before.

  • Moss et al., Nature Communications, 2018: the NCBI Gene Expression Omnibus (GEO, GSE1221261) database repository.
  • Loyfer et al., Nature, 2023: the NCBI Gene Expression Omnibus (GEO, GSE186458) database repository.

The gene expression data of white-blood cells (myeloid) used in this study was made publically available.

Gene expression and DNA methylation data of other tumors and healthy white blood cells (Supplementary Figure 15) was downloaded from the website of the The Cancer Genome Atlas (TCGA).

Pre-processing

In order to go from the raw sequencing-data, as presented in the ‘fastq’-files, towards the GRanges-objects, containing information about cell-free DNA fragment positions (as defined by ‘chromosome’, ‘start’ and ‘end’), we refer to the GitHub-repository from the paper: Mathios et al., Nature Communications, 2021. In the code-folder, there is a pre-processing-folder, containg the scripts to pre-process the ‘fastq’-files.

  • fastp.sh: We used ‘fastp’ specifically to trim adapters, when the cell-free DNA fragments were shorter than the amount of read-out cycles, which would have led to reading into the adapter on the other side of the DNA-strand.
  • align.sh: We used ‘Bowtie2’ to align the reads of the fastq-files to HG19.
  • post_alignment.sh: We used ‘Sambamba’ to flag duplicates and to extract cell-free DNA fragment information, like positions (as defined by ‘chromosome’, ‘start’ and ‘end’) and MAPQ-values, which are written to a BED-file.
  • bed_to_granges.sh: We used a custom R-script (01-bed_to_granges.r), to transform the BED-file into a GRanges-object, saved as a ‘.rds’-file. During this process we also filter cell-free DNA fragments for having a MAPQ-value of => 30, in order to filter out low-quality mapped fragments.

Figure 1

  • Pre_Figure1.rmd: contains a step-by-step guide which scripts will process the raw data (GRanges-objects; per sample) to intermediary files (per sample) and summarize them (all samples) into a summary-file, uploaded to this repository (data). This script requires the raw data (after pre-processing the data from Cristiano et al. and Mathios et al.).
  • Figure1.rmd: process summarized file and generate Figure 1.
  • Figure 1D contains a 3D rendering of a nucleosome, as captured from the Protein Data Bank (structure 7COW).

Figure 2

  • Pre_Figure2.rmd: contains a step-by-step guide which scripts will process the raw data (GRanges-objects; per sample) to intermediary files (per sample) and summarize them (all samples) into a summary-file, uploaded to this repository (data). This script requires the raw data (after pre-processing the data from Cristiano et al. and Mathios et al.).
  • Figure2.rmd: process summarized file and generate Figure 2.

Figure 3

  • Pre_Figure3.rmd: contains a step-by-step guide which scripts will process the raw data (GRanges-objects; per sample) to intermediary files (per sample) and summarize them (all samples) into a summary-file, uploaded to this repository (data). This script requires the raw data (after pre-processing the data from Cristiano et al. and Mathios et al.).
  • Figure3.rmd: process summarized file and generate Figure 3 (except Figure 3E and 3F).
  • Figure3EF.rmd: process summarized file and generate Figure 3E and 3F.

Figure 4

  • Pre_Figure4.rmd: contains a step-by-step guide which scripts will process the raw data (GRanges-objects; per sample) to intermediary files (per sample) and summarize them (all samples) into a summary-file, uploaded to this repository (data). This script requires the raw data (after pre-processing the data from Cristiano et al. and Mathios et al.).
  • Pre_Figure4_Model_Ensemble.Rmd: code that uses the summarized file (all samples) to generate the prediction model and generate a summarized file of this model. The code uses a 10-fold cross validation method, which needs to be manually changed for every fold (indicated in the code).
  • Figure4.rmd: process summarized files and generate Figure 4.

Supplementary Figure 1

Supplementary Figure 2

This figure is a variation on Figure 2A, using different beta-value cut-offs to define ‘methylated’ and ‘unmethylated’. * Pre_Figure2.rmd: contains a step-by-step guide which scripts will process the raw data (GRanges-objects; per sample) to intermediary files (per sample) and summarize them (all samples) into a summary-file, uploaded to this repository (data). This script requires the raw data (after pre-processing the data from Cristiano et al. and Mathios et al.). * Supplementary_Figure2.rmd: process summarized file and generate Supplementary Figure 2.

Supplementary Figure 3

Supplementary Figure 4

Supplementary Figure 5

Supplementary Figure 6

Supplementary Figure 7

Supplementary Figure 8

Supplementary Figure 9

Supplementary Figure 10

Supplementary Figure 11

Supplementary Figure 12

  • Pre_Figure3.rmd: generates a matrix, connecting CpG-islands (and beta-values) to the nearest transcription start site (TSS) (and gene-expression values), which is uploaded to this repository (data).
  • Figure3.rmd: populates the previously generated matrix with information about coverage, fragment-size and nucleosome positioning (data).
  • Supplementary_Figure12.rmd: process summarized file and generate Supplementary Figure 12.

Supplementary Figure 13

Supplementary Figure 14

Supplementary Figure 15

Supplementary_Figure15.rmd: process summarized file and generate Supplementary Figure 15.

Session information

pander::pander(sessionInfo())

R version 4.3.2 (2023-10-31)

Platform: aarch64-apple-darwin20 (64-bit)

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

loaded via a namespace (and not attached): digest(v.0.6.33), R6(v.2.5.1), fastmap(v.1.1.1), xfun(v.0.41), cachem(v.1.0.8), knitr(v.1.45), htmltools(v.0.5.7), rmarkdown(v.2.25), lifecycle(v.1.0.4), cli(v.3.6.1), pander(v.0.6.5), sass(v.0.4.7), jquerylib(v.0.1.4), compiler(v.4.3.2), rstudioapi(v.0.15.0), tools(v.4.3.2), evaluate(v.0.23), bslib(v.0.6.0), Rcpp(v.1.0.11), yaml(v.2.3.7), rlang(v.1.1.2) and jsonlite(v.1.8.7)